This document details the replication instructions for "Selecting More Informative Training Sets with Fewer Observations" by Aaron Kaufman in "Political Analysis".

SYSTEM INFORMATION

This code was run on a Windows Server 2019 remote machine with 256 GB of RAM and 32 cores. This full replication file takes approximately 1 month to run in its entirety. Note that the python notebooks, especially in steps 2-4 below, produce more than 1,000 intermediate files taking more than 40 GB of space. We do not include these in the replication materials.

The software environment involves Python 3.9.10 and R 4.2.2.

The Python libraries and their versions (where applicable) are:
os
pandas 1.4.3
numpy 1.21.1
random
math
re
beautifulsoup4 4.11.1
nltk 3.8.1
sklearn 0.0
sentence-transformers 2.2.2
tensorflow 2.11.0
tensorflow_hub 0.12.0
umap 0.1.1
import_ipynb 0.1.4
torch 1.12.0
torchvision 0.13.0
itertools 
asyncio 3.4.3
fast-pytorch-kmeans 0.1.6
scipy 1.8.1
argparse

The R libraries and their versions are:
tidyverse 1.3.2
data.table 1.14.6
xtable 1.8.4
stargazer 5.2.3
cowplot 1.1.1
reshape2 1.4.4
stringr 1.5.0



REPLICATION INSTRUCTIONS

To run this replication file, set your terminal working directory to the folder containing this Manifesto and run main.sh. However, users interested in walking through the code themselves may run scripts in the following order:

Python Steps:
0. Run the install_python_libraries bash script (requires pip)
1a. Run Embeddings_eo.ipynb* -- approx. 2 hours
1b. Run Embeddings_stwts.ipynb** -- approx. 1 hour
2. Run GenDistMatrices.ipynb -- approx. 30 hours
3. Run Indices.ipynb -- approx. 2 weeks
4. Run ReconstructionLoss.ipynb -- approx. 2 weeks
5a. Run MultinomialNB_eo.ipynb**
5b. Run MultinomialNB_stwts.ipynb

R Steps:
6. Run install_libraries.R (only necessary once)
7. Run fig1_performance.R
8. Run figA3_common_ids.R
9. Run figA4_time.R

* Note that the first time Embeddings_eo.ipynb is run, it takes additional time to download and install the embedding models we test.
** These pairs of scripts can be run in parallel. They run in about 2 hours each.

MANIFESTO

filename		type		purpose
eo_clean_full.csv			raw data 		Raw Executive Orders data set
stocktwits_clean_full.csv			raw data 		Raw Stocktwits data set

eo_roc_mnb.csv			results data			Accuracy results of a Multinomial Bayes classifier on the Executive Orders data set
stwts_roc_mnb.csv			results data			Accuracy results of a Multinomial Bayes classifier on the Stocktwits data set
matrix_of_commons.csv			results data			A matrix comparing the number of in-common documents selected to train
time_df.csv			results data			A data set comparing how much computation time each method costs

featurizers.py		python script		Helper script to produce text features
get_observations.py		python script		Helper script to extract observations to label
observation_selectors.py		python script		Helper script to select observations to label

install_libraries.R     R scripts		Installs necessary libraries
fig1_performance.R		R script		Produce Figure 1
figA3_common_ids.R		R script		Produce Figure A3
figA4_time.R		R script		Produce Figure A4
main_figures.R 		R script 		Sources the above three R scripts in order

Embeddings_eo.ipynb		python notebook		Generate text embeddings for Executive Orders data
Embeddings_stwts.ipynb		python notebook		Generate text embeddings for Stocktwits data
GenDistMatrices.ipynb		python notebook		Generate distance matrices for simulations
Indices.ipynb		python notebook		Calculate indices for which documents to label
ReconstructionLoss.ipynb		python notebook		Performs the Reconstruction Loss selection method
MultinomialNB_eo.ipynb		python notebook		Estimates accuracy on the Executive Orders data
MultinomialNB_stwts.ipynb		python notebook		Estimates accuracy on the Stocktwits data

Embeddings_eo.py		python script		Generate text embeddings for Executive Orders data
Embeddings_stwts.py		python script		Generate text embeddings for Stocktwits data
GenDistMatrices.py		python script		Generate distance matrices for simulations
Indices.py		python script		Calculate indices for which documents to label
ReconstructionLoss.py		python script		Performs the Reconstruction Loss selection method
MultinomialNB_eo.py		python script		Estimates accuracy on the Executive Orders data
MultinomialNB_stwts.py		python script		Estimates accuracy on the Stocktwits data

install_python_libraries	Bash scripts		Installs necessary python libraries
main.sh 					Bash scripts		Runs everything, in order